How do I deal with content scrapers? [closed]

Posted by aem on Pro Webmasters See other posts from Pro Webmasters or by aem
Published on 2012-04-04T02:45:47Z Indexed on 2012/04/04 11:41 UTC
Read the original article Hit count: 369

Filed under:

robots.txt

|

spam

|

heroku

Possible Duplicate:
How to protect SHTML pages from crawlers/spiders/scrapers?

My Heroku (Bamboo) app has been getting a bunch of hits from a scraper identifying itself as GSLFBot. Googling for that name produces various results of people who've concluded that it doesn't respect robots.txt (eg, http://www.0sw.com/archives/96).

I'm considering updating my app to have a list of banned user-agents, and serving all requests from those user-agents a 400 or similar and adding GSLFBot to that list. Is that an effective technique, and if not what should I do instead?

(As a side note, it seems weird to have an abusive scraper with a distinctive user-agent.)

© Pro Webmasters or respective owner

Related posts about robots.txt

Robots.txt practices with .htaccess redirections (inherits)

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories… >>> More
mod evasive not working properly on ubuntu 10.04

as seen on Server Fault - Search for 'Server Fault'
I have an ubuntu 10.04 server where I installed mod_evasive using apt-get install libapache2-mod-evasive I already tried several configurations, the result stays the same. The blocking does work, but randomly. I tried with low limis and long blocking periods as well as short limits. The behaviour… >>> More
Cross-domain jQuery using YQL gives robots.txt error

as seen on Stack Overflow - Search for 'Stack Overflow'
On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows… >>> More
Asterisk in robots.txt

as seen on Stack Overflow - Search for 'Stack Overflow'
Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct? >>> More
SEO chaos from changing robots.txt file in Wordpress site

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi there, I recently edited the robots.txt file in my site using a wordpress plugin. However, since i did this, google seems to have removed my site from their search page. I'd appreciate if I could get an expert opinion on why this is so, and a possible solution. I'd initially done it to increase… >>> More

Related posts about spam

spam spam spam spam [closed]

as seen on Stack Overflow - Search for 'Stack Overflow'
spam removed spam removed spam removed spam removed >>> More
Stop Outgoing Spam Already Tagged as Spam

as seen on Server Fault - Search for 'Server Fault'
Hi, I run a Postfix server with Amavis and Spamassassin among other things. Postfix receives mail from the outside world and passes it on to Amavis. Amavis has Spamassassin rate the mail, and then tags it as spam if necessary. Then Postfix relies on each users' procmailrcs to deal with the mail as… >>> More
spam check, spam score how to?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi. I am doing a app that is sending email and need to have a spam checker on the outgoing email. I have been looking for this a while now, I can not seem to find a good solution. I would like to use something like the spamassassin. Do you guys got any examples how to do the spamassassin with asp… >>> More
configure spam assassin to delete all spam above a score domain wide and override individual settin

as seen on Server Fault - Search for 'Server Fault'
Okay, so this is my scenario and what I want to try and do. I maintain a Red Hat email server running qmail and spamassassin. I have a domain that has well over 100 email account each with individual settings for spam scores and, whether or not to delete email incoming mail deemed spam. What I… >>> More
Make Thundirbird use and sync with Gmail Spam facility

as seen on Super User - Search for 'Super User'
I am using Thundirbird 3.1.7. I am using Gmail with it. When I mark a message as spam in Thunderbird, I want it to go to Gmail's spam folder. But it is not happening. The mail is only marked as spam and stays in Gmail's inbox. How can I setup Thunderbird so that when I mark a message as spam in… >>> More